{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Introduction to Machine Learning\n", "# Topic 2.1: Pandas and Data Sets\n", "\n", "This notebook provides a description of how data sets are represented and manipulated using the `pandas` library." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is pandas?\n", "\n", "Pandas stands for \"PANel DAta,\" an econometric term for data sets. Webpage: [link](https://pandas.pydata.org/docs/index.html).\n", "\n", "It provides two main objects: a **DataFrame** and a **Series**.\n", "\n", "A DataFrame object stores a 2-dimensional table of data, while a Series stores a 1-dimensional vector of data.\n", "\n", "Pandas provides useful functions for working with these objects including functions for:\n", "1. Loading data sets from files and storing them in DataFrame and/or Series objects.\n", "2. Manipulating DataFrame and Series objects (e.g., adding or removing features).\n", "3. Computing statistics of the data (e.g., the minimum and maximum values of features).\n", "\n", "Pandas has become so common that many other ML libraries in python are built to be compatible with pandas, as we will see below.\n", "\n", "To install pandas, run the following command in the console or command line:\n", "\n", "> pip install pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Data Sets\n", "\n", "In the remainder of this notebook be load and inspect a few example data sets for supervised learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GPA Data\n", "\n", "The GPA data set contains data about undergraduate students and the *Universidade Federal do Rio Grande do Sul* (UFRGS) in Brazil.\n", "\n", "**Input**: Scores on 9 entrance exams: \n", "1. Physics\n", "2. Biology\n", "3. History\n", "4. English\n", "5. Geography\n", "6. Literature\n", "7. Portuguese\n", "8. Math\n", "9. Chemistry\n", "\n", "**Output**: GPA on a 4.0 scale during the first three semesters at university.\n", " - The GPA can be used for regression (predict the GPA) or classification (predict the GPA range, e.g., whether it is at least 3.0).\n", "\n", "**Data set Size**: 43,303\n", "\n", "Let's start by loading and displaying this data set. The data set is available here:\n", "\n", "[https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv](https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv)\n", "\n", "You can download it and place it inside a directory called `data`, next to this .ipynb file, and can load the data set from this local copy, or you can directly load it from the online posting:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
physicsbiologyhistoryEnglishgeographyliteraturePortuguesemathchemistrygpa
0622.60491.56439.93707.64663.65557.09711.37731.31509.801.33333
1538.00490.58406.59529.05532.28447.23527.58379.14488.642.98333
2455.18440.00570.86417.54453.53425.87475.63476.11407.151.97333
3756.91679.62531.28583.63534.42521.40592.41783.76588.262.53333
4584.54649.84637.43609.06670.46515.38572.52581.25529.041.58667
.................................
43298519.55622.20660.90543.48643.05579.90584.80581.25573.922.76333
43299816.39851.95732.39621.63810.68666.79705.22781.01831.763.81667
43300798.75817.58731.98648.42751.30648.67662.05773.15835.253.75000
43301527.66443.82545.88624.18420.25676.80583.41395.46509.802.50000
43302512.56415.41517.36532.37592.30382.20538.35448.02496.393.16667
\n", "

43303 rows × 10 columns

\n", "
" ], "text/plain": [ " physics biology history English geography literature Portuguese \\\n", "0 622.60 491.56 439.93 707.64 663.65 557.09 711.37 \n", "1 538.00 490.58 406.59 529.05 532.28 447.23 527.58 \n", "2 455.18 440.00 570.86 417.54 453.53 425.87 475.63 \n", "3 756.91 679.62 531.28 583.63 534.42 521.40 592.41 \n", "4 584.54 649.84 637.43 609.06 670.46 515.38 572.52 \n", "... ... ... ... ... ... ... ... \n", "43298 519.55 622.20 660.90 543.48 643.05 579.90 584.80 \n", "43299 816.39 851.95 732.39 621.63 810.68 666.79 705.22 \n", "43300 798.75 817.58 731.98 648.42 751.30 648.67 662.05 \n", "43301 527.66 443.82 545.88 624.18 420.25 676.80 583.41 \n", "43302 512.56 415.41 517.36 532.37 592.30 382.20 538.35 \n", "\n", " math chemistry gpa \n", "0 731.31 509.80 1.33333 \n", "1 379.14 488.64 2.98333 \n", "2 476.11 407.15 1.97333 \n", "3 783.76 588.26 2.53333 \n", "4 581.25 529.04 1.58667 \n", "... ... ... ... \n", "43298 581.25 573.92 2.76333 \n", "43299 781.01 831.76 3.81667 \n", "43300 773.15 835.25 3.75000 \n", "43301 395.46 509.80 2.50000 \n", "43302 448.02 496.39 3.16667 \n", "\n", "[43303 rows x 10 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd # Import pandas\n", "\n", "# Load the data set directly from the online link, assuming numbers are separated by commas\n", "df = pd.read_csv(\"https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv\", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas\n", "\n", "# Load the data set from a local `data` directory, assuming numbers are separated by commas\n", "# df = pd.read_csv(\"data/GPA.csv\", delimiter=',')\n", "\n", "# print(df) # Prints a string representation of the DataFrame\n", "display(df) # Renders an HTML table (for Jupyter Notebooks - don't use in .py file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**: Is each column numerical, categorical, text, or an image? Continuous, discrete, nominal, or ordinal?\n", "\n", "**Answer**: All of these columns are numerical and continuous.\n", "\n", "**Question**: If the GPAs were binned into letter grades A, B, C, ..., F, would they be numerical, categorical, text, or an image? Continuous, discrete, nominal, or ordinal?\n", "\n", "**Answer**: In this case the GPAs would be categorical, and specifically ordinal.\n", "\n", "Notice that pandas views this as a table with rows and columns. Hence features *and* labels are viewed as \"columns\" when using pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manipulating DataFrames\n", "\n", "In this section we give some examples of how DataFrames can be used to compute statistics of data and how DataFrames can be manipulated.\n", "\n", "First, let's use the `iloc` (integer-location based indexing for selection by position) function in pandas to split the dataset into the input features $X$ and the targets/labels $y$. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
physicsbiologyhistoryEnglishgeographyliteraturePortuguesemathchemistry
0622.60491.56439.93707.64663.65557.09711.37731.31509.80
1538.00490.58406.59529.05532.28447.23527.58379.14488.64
2455.18440.00570.86417.54453.53425.87475.63476.11407.15
3756.91679.62531.28583.63534.42521.40592.41783.76588.26
4584.54649.84637.43609.06670.46515.38572.52581.25529.04
..............................
43298519.55622.20660.90543.48643.05579.90584.80581.25573.92
43299816.39851.95732.39621.63810.68666.79705.22781.01831.76
43300798.75817.58731.98648.42751.30648.67662.05773.15835.25
43301527.66443.82545.88624.18420.25676.80583.41395.46509.80
43302512.56415.41517.36532.37592.30382.20538.35448.02496.39
\n", "

43303 rows × 9 columns

\n", "
" ], "text/plain": [ " physics biology history English geography literature Portuguese \\\n", "0 622.60 491.56 439.93 707.64 663.65 557.09 711.37 \n", "1 538.00 490.58 406.59 529.05 532.28 447.23 527.58 \n", "2 455.18 440.00 570.86 417.54 453.53 425.87 475.63 \n", "3 756.91 679.62 531.28 583.63 534.42 521.40 592.41 \n", "4 584.54 649.84 637.43 609.06 670.46 515.38 572.52 \n", "... ... ... ... ... ... ... ... \n", "43298 519.55 622.20 660.90 543.48 643.05 579.90 584.80 \n", "43299 816.39 851.95 732.39 621.63 810.68 666.79 705.22 \n", "43300 798.75 817.58 731.98 648.42 751.30 648.67 662.05 \n", "43301 527.66 443.82 545.88 624.18 420.25 676.80 583.41 \n", "43302 512.56 415.41 517.36 532.37 592.30 382.20 538.35 \n", "\n", " math chemistry \n", "0 731.31 509.80 \n", "1 379.14 488.64 \n", "2 476.11 407.15 \n", "3 783.76 588.26 \n", "4 581.25 529.04 \n", "... ... ... \n", "43298 581.25 573.92 \n", "43299 781.01 831.76 \n", "43300 773.15 835.25 \n", "43301 395.46 509.80 \n", "43302 448.02 496.39 \n", "\n", "[43303 rows x 9 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0 1.33333\n", "1 2.98333\n", "2 1.97333\n", "3 2.53333\n", "4 1.58667\n", " ... \n", "43298 2.76333\n", "43299 3.81667\n", "43300 3.75000\n", "43301 2.50000\n", "43302 3.16667\n", "Name: gpa, Length: 43303, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "X = df.iloc[:, :-1] # All columns except the last as features. This creates a new DataFrame X.\n", "print(type(X)) # Confirm that this is actually a new DataFrame by printing the type of X.\n", "y = df.iloc[:, -1] # The last column contains the labels. This creates a new Series (like a 1-dimensional DataFrame) y\n", "display(X) # Display the input columns\n", "display(y) # Display the output (label) column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the variable `y` displays differently from `X`. This is because `y` is a Series (1-dimensional vector), while `X` is a DataFrame (2-dimensional matrix/table).\n", "\n", "Also, in the output of the above block, `float64` means that each element in the `y` Series is a floating point number represented with 64 bits." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }